Case-sensitive letter and bigram frequency counts from large-scale English corpora.

نویسندگان

  • Michael N Jones
  • D J K Mewhort
چکیده

We tabulated upper- and lowercase letter frequency using several large-scale English corpora (approximately 183 million words in total). The results indicate that the relative frequencies for upper- and lowercase letters are not equivalent. We report a letter-naming experiment in which uppercase frequency predicted response time to uppercase letters better than did lowercase frequency. Tables of case-sensitive letter and bigram frequency are provided, including common nonalphabetic characters. Because subjects are sensitive to frequency relationships among letters, we recommend that experimenters use case-sensitive counts when constructing stimuli from letters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Frequency and Versatility of Initial and Terminal Letters in English Words

A review of previous word and letter counts in addition to the applications of these counts were reported. A comprehensive count of initial and terminal letters and bigrams was compiled based on the KuSera and Francis (Computational analysis of present-day American English. Providence: Brown Univ. Press, 1967) corpus of English words. The count included frequency of occurrence and versatility, ...

متن کامل

Correlated Bigram LSA for Unsupervised Language Model Adaptation

We present a correlated bigram LSA approach for unsupervised LM adaptation for automatic speech recognition. The model is trained using efficient variational EM and smoothed using the proposed fractional Kneser-Ney smoothing which handles fractional counts. We address the scalability issue to large training corpora via bootstrapping of bigram LSA from unigram LSA. For LM adaptation, unigram and...

متن کامل

Typing Letter Strings Varying in Orthographic

Subjects typed six-letter strings varying in orthographic structure. Lexical status, word frequency, position-sensitive log bigram frequency, and regularity of letter sequencing were systematically varied. Cumulative reaction times (RTs) of the keystrokes were adequately described by a linear function of letter position in the test string. Overall, words were typed faster than nonwords, and reg...

متن کامل

Deriving a bi-lingual dictionary from raw transcription data

We present a bigram-based method for deriving bi-lingual dictionary entries from two corpora of spontaneous speech (as represented in transcriptions). In contrast to e.g. [1], our method does not require translated or otherwise aligned texts; the corpora representing the source and target languages may be unrelated wrt. size, vocabulary richness, frequency distribution, and activity type. Examp...

متن کامل

Detecting DNS Tunnels Using Character Frequency Analysis

High-bandwidth covert channels pose significant risks to sensitive and proprietary information inside company networks. Domain Name System (DNS) tunnels provide a means to covertly infiltrate and exfiltrate large amounts of information passed network boundaries. This paper explores the possibility of detecting DNS tunnels by analyzing the unigram, bigram, and trigram character frequencies of do...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc

دوره 36 3  شماره 

صفحات  -

تاریخ انتشار 2004